Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

نویسندگان

  • Lei Jin
  • Sangyeun Cho
چکیده

Private caching and shared caching are the two conventional approaches to managing distributed L2 caches in current multicore processors. Unfortunately, neither shared caching, nor private caching guarantees optimal performance under different workloads, especially when many processor cores and cache slices are provided on a switched network. This paper takes a very different approach from the existing hardware-based schemes, allowing data to be flexibly mapped to cache slices at the memory page granularity [4]. Using a profile-guided execution-driven simulation method, we perform a limit study on the performanceoptimal two-dimensional page mappings, given a multicore memory hierarchy and on-chip network configuration. Our study shows that a judicious data mapping to improve both on-chip miss rate and cache access latency results in significant performance improvement (up to 108%), exceeding the two existing methods. Our result strongly suggests that a well-conceived dynamic data mapping mechanism will achieve similarly high performance on an OS-managed distributed L2 cache structure. 1 Private Caching vs. Shared Caching Multicore processors have hit the market at all fronts. Processors with two to eight cores are available [5, 8, 10], and they are widely deployed in PCs, servers and embedded systems. While improving program performance in general, the trend of integrating many cores on a single chip can make performance more sensitive to how the on-chip memory hierarchy is managed, especially at a lower level (e.g., L2 caches). The conventional L2 cache management schemes are private caching and shared caching. In the private caching scheme, a local cache slice always keeps a copy of the accessed data, potentially replicating the same data in multiple cache slices. This replication of data results in reduced effective caching capacity, often leading to more on-chip misses. The benefit of the private caching scheme is a lower cache hit latency. On the other hand, the shared caching scheme always maps data to a fixed location. Because there is no replication of data, this scheme achieves a lower onchip miss rate than private caching. However, the average cache hit latency is larger, because cache blocks are simply distributed to all available cache slices. To remedy the deficiencies in the two existing schemes, many works have been done. Zhang and Asanović [11] proposed victim replication in a shared L2 cache organization. In their design, L2 cache slices can store a replaced cache line from their local L1 caches as well as their designated cache lines. Essentially, the local L2 cache slice provides a large victim caching space for the cache lines whose home is remote, similar to [7]. A downside of this approach happens on a local L1 cache miss or an external coherence request; both L1 cache and L2 cache (in parallel or in sequence) should be checked, because it is not readily known if a (remote) cache block has been copied in the local L2 cache slice. Chishti et al. [3] proposed a cache design called CMP-NuRAPID having a hybrid of private, per-processor tag arrays and a shared data array. Based on the hardware organization, they studied a series of optimizations, such as controlled replication, in-situ communication, and capacity stealing. Compared with a shared cache organization, however, CMP-NuRAPID requires a more complex coherence and cache management hardware. For example, it implements a distributed directory mechanism by maintaining forward and reverse pointers between the private tag arrays and the shared data arrays. Chang and Sohi [2] proposed a cooperative caching framework based on a private cache design with a centralized directory scheme. They studied several optimizations such as cache-to-cache transfer of clean data, replication-aware data replacement, and Figure 1. An example of a 16-core tile-based multicore processor. Each tile consists of a processor core, a private L1 cache, a slice of global shared L2 cache and a switch. Communication among tiles are through a meshbased on-chip interconnection network. global replacement of inactive data. Experimental results show that the proposed optimizations effectively limit cache block replication and thus result in a higher on-chip cache hit rate. However, the optimizations come at the expense of a more complex central directory than that of a baseline private cache design. In summary, the central ideas found in these works are to balance between the best on-chip miss rate (i.e., shared caching) and the best cache access latency (i.e., private caching). Unfortunately, none of these previous works directly answers the key question: what is the optimal point between these two opponents? This work investigates the optimal trade-off between the cache access latency and the on-chip miss rate, given many distributed, non-replicating L2 cache slices. This work is based on an OS-managed distributed L2 cache structure [4] and profile-guided twodimensional data mapping at the page granularity. In the remainder of this paper, we will briefly discuss the OS-managed L2 cache framework first. The profile-guided two-dimensional coloring algorithm with its results are then presented. 2 OS-Managed Distributed L2 Cache Data distribution becomes a critical performance factor in a multicore processor architecture, especially when a non-uniform latency cache architecture (NUCA) is employed. Throughout this paper, we assume a tile-based 16core processor model as shown in Figure. 1. The data distribution granularity in a conventional shared caching method has been cache block [5,8,10], which is determined mainly by the bandwidth requirement. This is no longer optimal in a large-scale NUCA processor, however, where the cache P Tile 1 #Access #Miss

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses

Modern CPUs often use large physically-indexed caches that are direct-mapped or have low associativities. Such caches do not interact well with virtual memory systems. An improperly placed physical page will end up in a wrong place in the cache, causing excessive conflicts with other cached pages. Page coloring has been proposed to reduce the conflict misses by carefully placing pages in the ph...

متن کامل

Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems

Most modern computing systems are having multicore processors with multilevel caches for high performance. Caches increase total power consumption and worsen execution time unpredictability. Studies show that way (or partial) cache locking may improve timing predictability and performance-to-power ratio for both single-core and multicore systems. Even though both private first-level and shared ...

متن کامل

Understanding the Limits of Capacity Sharing in CMP Private Caches

Chip Multi Processor (CMP) systems present interesting design challenges at the lower levels of the cache hierarchy. Private L2 caches allow easier processor-cache design reuse, thus scaling better than a system with a shared L2 cache, while offering better performance isolation and lower access latency. While some private cache management schemes that utilize space in peer private L2 caches ha...

متن کامل

Computer Science and Artificial Intelligence Laboratory Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches

Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but greater capacity. Victim replication was previously introduced as a way of reducing the average ...

متن کامل

Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches

Future CMPs will have more cores and greater onchip cache capacity. The on-chip cache can either be divided into separate private L2 caches for each core, or treated as a large shared L2 cache. Private caches provide low hit latency but low capacity, while shared caches have higher hit latencies but greater capacity. Victim replication was previously introduced as a way of reducing the average ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007